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ABSTRACT 

The existing solutions to privacy preserving publication can be 
classified into the theoretical and heuristic categories. The for- 
mer guarantees provably low information loss, whereas the latter 
incurs gigantic loss in the worst case, but is shown empirically to 
perform well on many real inputs. While numerous heuristic algo- 
rithms have been developed to satisfy advanced privacy principles 
such as /-diversity, t-closeness, etc., the theoretical category is cur- 
rently limited to fc-anonymity which is the earliest principle known 
to have severe vulnerability to privacy attacks. Motivated by this, 
we present the first theoretical study on i-diversity, a popular prin- 
ciple that is widely adopted in the literature. First, we show that 
optimal /-diverse generalization is NP-hard even when there are 
only 3 distinct sensitive values in the microdata. Then, an (I • d)- 
approximation algorithm is developed, where d is the dimension- 
ality of the underlying dataset. This is the first known algorithm 
with a non-trivial bound on information loss. Extensive experi- 
ments with real datasets validate the effectiveness and efficiency of 
proposed solution. 

1. INTRODUCTION 

Privacy preserving publication has become an active topic in 
databases. An important problem is the prevention of linking at- 
tacks [38,43]. To explain this threat, assume that a hospital releases 
the patients' details in Table[TJ called the microdata, to medical re- 
searchers. Disease is a sensitive attribute (SA) because a patient's 
disease is regarded as her/his privacy. Attribute Name is not part 
of the table, but it will be used to facilitate tuple referencing. Con- 
sider an adversary that knows (i) the age (< 30), gender (M) and 
education level (bachelor) of Calvin, and (ii) Calvin has a record in 
the microdata. Thus, s/he easily finds out that Tuple 3 is Calvin's 
record and hence, Calvin contracted pneumonia. 

In the above attack, columns Age, Gender, and Education are 
quasi-identifier (QI) attributes because they can be combined to re- 
veal an individual's identity. The cause of privacy leakage is that 
an individual (e.g., Calvin) may have a unique set of QI values. A 
common approach for fixing this problem is generalization, which 
partitions the microdata into Ql-groups, and then, converts the QI 
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Tuple ID (Name) 


Age 


Gender 


Education 


Disease 


1 (Adam) 


<30 


M 


Master 


HIV 


2 (Bob) 


<30 


M 


Master 


HIV 


3 (Calvin) 


<30 


M 


Bachelor 


pneumonia 


4 (Danny) 


[30, 50) 


M 


Bachelor 


bronchitis 


5 (Eva) 


[30, 50) 


F 


Bachelor 


pneumonia 


6 (Fiona) 


[30, 50) 


F 


Bachelor 


bronchitis 


1 (Ginny) 


[30, 50) 


F 


Bachelor 


bronchitis 


8 (Helen) 


[30, 50) 


F 


Bachelor 


pneumonia 


9 (Ivy) 


>50 


F 


High Sch. 


dyspepsia 


10 (Jane) 


>50 


F 


High Sch. 


pneumonia 




Table 1: The microdata 
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Education 


Disease 


1 (Adam) 


<30 
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Master 


HIV 


2 (Bob) 


<30 
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Master 


HIV 


3 (Calvin) 




M 


Bachelor 


pneumonia 


4 (Danny) 




M 


Bachelor 


bronchitis 


5 (Eva) 


[30, 50) 


F 


Bachelor 


pneumonia 


6 (Fiona) 


[30, 50) 


F 


Bachelor 


bronchitis 


7 (Ginny) 


[30, 50) 


F 


Bachelor 


bronchitis 


8 (Helen) 


[30, 50) 


F 


Bachelor 


pneumonia 


9 (Ivy) 


>50 


F 


High Sch. 


dyspepsia 


10 (Jane) 


>50 


F 


High Sch. 


pneumonia 



Ql-group 1 
Ql-group 2 

■ Ql-group 3 
Ql-group 4 



Table 2: 2-anonymous publication 

values in each group to the same form, e.g., replaces distinct val- 
ues on each QI attribute with stars. For example, Table|2]shows a 
generalization of Table [TJ based on a partition of four Ql-groups. 
Notice that, in the second Ql-group, the ages of Tuples 3 and 4 have 
been suppressed into stars, since their original values are different. 

A generalized table can be released if it satisfies an anonymiza- 
tion principle, which determines the quality of privacy protection. 
The earliest principle is k-anonymity [38,43], which requires each 
Ql-group to contain at least k tuples. As a result, each tuple car- 
ries the same QI values as at least k — 1 other tuples. For instance, 
Tablef2]is 2-anonymous. Given this table, the adversary mentioned 
earlier cannot tell whether Tuple 3 or 4 belongs to Calvin. 

Machanavajjhala et al. [31] observe that fc-anonymity suffers 
from the homogeneity problem: a Ql-group may have too many 
tuples with the same SA (sensitive attribute) value. For example, 
both tuples in the first Ql-group of Table |2]have HIV. As a result, 
an adversary having the QI particulars of Adam (or Bob) can as- 
sert that Adam (Bob) has HIV, without having to identify the tuple 
owned by Adam (Bob). Note that the problem cannot be eliminated 
by increasing fc, because fc-anonymity places no constraint on the 
SA values in each Ql-group. 

The above problem has led to the development of numerous SA- 
aware principles, which set forth conditions to be fulfilled by the 
SA values in each Ql-group. Among the existing principles, /- 
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1 (Adam) 




M 




HIV 


2 (Bob) 
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M 




HIV 


3 (Calvin) 
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M 




pneumonia 


4 (Danny) 
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bronchitis 


5 (Eva) 


[30, 50) 
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Bachelor 


pneumonia 


6 (Fiona) 


[30, 50) 
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Bachelor 


bronchitis 


1 (Ginny) 


[30, 50) 
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Bachelor 


bronchitis 


8 (Helen) 


[30, 50) 
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Bachelor 


pneumonia 


9 (Ivy) 


>50 
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High Sch. 


dyspepsia 


10 (Jane) 


>50 
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High Sch. 


pneumonia 
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.QI-group 2 



Ql-group 3 



Table 3: 2-diverse publication 

diversity [31] is the most widely deployed [16,23,31,46,47,49], 
due to its simplicity and good privacy guarantee. Specifically, this 
principle demand^] that, in each QI-group, at most 1 // of its tuples 
can have an identical SA value. Table [3] demonstrates a 2-diverse 
generalization of Table [TJ It can be easily verified that, in each QI- 
group, the frequency of each SA value is at most 50%. Thus, even 
if an adversary figures out the QI-group containing the record of an 
individual, s/he can determine the real SA value of the individual 
with no more than 50% confidence. 

1.1 Theory Stops at £>anonymity 

The goal of privacy preserving publication is to minimize the 
information loss (e.g., the number of stars used in Tables [2] and 
[3]( in enforcing the selected anonymization principle. The existing 
solution can be divided into two categories: theoretical and heuris- 
tic. The former develops algorithms with worst-case performance 
bounds. The latter, on the other hand, designs algorithms that work 
well on many real datasets, but may have very poor performance 
(i.e., incur gigantic information loss) on "unfriendly" inputs. 

We notice that, in terms of privacy protection, the theoretical 
category significantly lags behind its heuristic counterpart. As re- 
viewed in Section [2] many heuristic algorithms exist for various 
SA-aware principles that ensure strong privacy preservation. How- 
ever, as surveyed next, all the theoretical results concern with only 
fc-anonymity, and none of them deals with SA-aware principles. 
In other words, currently all the theoretical algorithms suffer from 
the homogeneity problem mentioned earlier, and thus, are weak in 
privacy guarantees. Note that, this drawback also reduces the prac- 
tical usefulness of their nice bounds of information loss, because a 
publisher puts privacy at a higher priority than utility. 

In the theoretical category, Meyerson and Williams [33] are the 
first to establish the complexity of optimal fc-anonymity, by show- 
ing that it is NP-hard to compute a fc-anonymous table that con- 
tains the minimum number of stars (i.e., suppressed values). They 
also provide a 0(k log fc) -approximation algorithm. Aggarwal et 
al. [5] offer a stronger NP-hardness proof that requires a smaller 
domain of the QI attributes. They also improve the approxima- 
tion ratio to O(k). Park and Shim [35] enhance the ratio further to 
0(log fc). It should be noted that, the algorithms in [33, 35] have 
running time exponential in k, while the running time of the algo- 
rithm in [5] is a polynomial of k and n. Du et al. [13] consider 
the case when the generalized table is produced not by replacing 
QI values with stars, but by applying multi-dimensional general- 
ization (see Section|2j- They show that enforcing fc-anonymity in 
this setting is still NP-hard, and give an O(d) approximation algo- 
rithm, where d is the number of QI attributes. Aggarwal et al. [4] 
propose clustering-based generalization, prove the NP-hardness of 



fc-anonymity (in [4], k is replaced by r), and provide constant 
proximation solutions. 



ap- 



1 Precisely speaking, /-diversity requires each QI-group to have at 
least I well-represented values. There are different interpretations 
of "well-represented" [31]. The version discussed here is widely 
adopted in the literature [16,45,47]. 



1.2 Our Results 

This paper presents the first theoretical study on /-diverse 
anonymization. In particular, we consider that the microdata is 
anonymized by suppressing QI values, and we aim at achieving 
/-diversity with the minimum number of stars. At first glance, 
a simple reduction from fc-anonymity seems to establish the NP- 
hardness of /-diversity. Specifically, given a table where no two tu- 
ples have the same SA value, the optimal /-diversity generalization 
is also the optimal "/-anonymity" generalization. Hence, if there 
was an optimal /-diverse algorithm that runs in polynomial time, 
the same algorithm can efficiently solve optimal fc-anonymity as 
well, which contradicts the NP-hardness of optimal fc-anonymity. 

The previous reduction requires that the number m of distinct 
SA values is as large as the cardinality n of the microdata. There- 
fore, a natural question is whether optimal /-diversity can be set- 
tled in polynomial time if ra < n, as is true in practice. In 
fact, the answer is apparently "yes" for m = 2, in which case the 
problem becomes bipartite matching (see Section[4j, a well-known 
polynomial-time solvable problem. The first major contribution of 
our work is a proof showing that /-diversity is NP-hard as long as 
m > 3. Clearly, this result is much stronger than the hardness re- 
sult from the earlier simple reduction. In fact, our result still holds 
even if the alphabet (i.e., the domain union of all attributes) has a 
size of only m + 1. 

On the algorithm side, we propose a solution that ensures an ap- 
proximation ratio of I d, where d is the the number of QI attributes 
in the microdata. This is the first algorithm on /-diversity with a 
non-trivial bound of information loss. Furthermore, our algorithm 
is also highly efficient - it runs in close-to-linear time. Although the 
/ ■ d approximation ratio may trigger concerns about the usefulness 
of our technique in practice, we note that the actual performance of 
our algorithm is much better than the theoretical bounds. Specif- 
ically, our algorithm executes in three phases, and depending on 
the dataset characteristics, may finish in any phase. Termination in 
the first one results in a d-approximate solution, while termination 
at the second phase incurs at most / ■ d additional stars (with re- 
spect to the d-approximation). In any case, an (/ ■ d) -approximate 
solution is guaranteed after the third phase. On the large set of 
datasets tested in our experiments, our algorithm always terminates 
before the third phase, thus achieving an approximation ratio of 
d. In addition, our algorithm can be further improved, when we 
combine it with a heuristic-based /-diversity solution. Empirical 
evaluation shows that, such a hybrid method significantly outper- 
forms the existing /-diversity algorithms, in terms of the number of 
stars required in anonymization. 

The rest of the paper is organized as follows. Section [2] surveys 
the previous work relevant to ours. Section[3]formally defines the 
problem. Section|4]establishes the hardness of the problem, while 
Section[5]presents our approximation algorithm and proves its qual- 
ity guarantees. Section[5]experimentally evaluates the effectiveness 
and efficiency of the proposed technique. Finally, Section [7] con- 
cludes the paper with directions for future work. 

2. RELATED WORK 

The existing theoretical results on privacy preserving publication 
have been explained in Section [TTT1 In the following, we focus on 
other approaches based on four categories. 

Anonymization methodologies Most of the existing work on mi- 
crodata publication adopts generalization to anonymize data. There 



T-ID (Name) 


Age 


Gender 


Education 


Disease 


1 (Adam) 


<50 


M 


Bachelor or above 


HIV 


2 (Bob) 


<50 


M 


Bachelor or above 


HIV 


3 (Calvin) 


<50 


M 


Bachelor or above 


pneumonia 


4 (Danny) 


<50 


M 


Bachelor or above 


bronchitis 


5 (Eva) 


<50 


F 


Bachelor or above 


pneumonia 


6 (Fiona) 


<50 


F 


Bachelor or above 


bronchitis 


7 (Ginny) 


<50 


F 


Bachelor or above 


bronchitis 


8 (Helen) 


<50 


F 


Bachelor or above 


pneumonia 


9 (Jyy) 


>50 


F 


High Sch. or below 


dyspepsia 


10 (Jane) 


>50 


F 


High Sch. or below 


pneumonia 



>QI-group 1 



I Ql-group 2 



> Ql-group 3 



Table 4: Single-dimensional generalization 
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Education 
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1 (Adam) 


<50 
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Bachelor or above 


HIV 


2 (Bob) 


<50 


M 


Bachelor or above 


HIV 


3 (Calvin) 


<50 
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Bachelor or above 


pneumonia 


4 (Danny) 


<50 
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Bachelor or above 


bronchitis 


5 (Eva) 


[30, 50) 
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Bachelor 


pneumonia 


6 (Fiona) 


[30, 50) 
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Bachelor 


bronchitis 


7 (Ginny) 


[30, 50) 
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Bachelor 


bronchitis 


8 (Helen) 


[30, 50) 
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Bachelor 


pneumonia 


9 (MO 


>50 
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High Sch. 


dyspepsia 


10 (Jane) 


>50 
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High Sch. 


pneumonia 



>QI-group 1 

.Ql-group 2 
Ql-group 3 



Table 5: Multi-dimensional generalization 

exist three variations of generalization, namely, suppression [1], 
single-dimensional generalization [7, 15, 20, 46, 50], and multi- 
dimensional generalization [16,27,28], Suppression replaces dis- 
tinct QI values in each Ql-group with stars, as demonstrated in Sec- 
tion [T] Single-dimensional generalization, on the other hand, di- 
vides the domain of each QI attribute into disjoint sub-domains, and 
maps each QI value in the microdata to the sub-domain that con- 
tains the value, i.e, it "coarsens" the domains of the QI attributes. 
For example, Table [4] illustrates a single-dimensional generaliza- 
tion of Table [T] that satisfies 2-diversity. In particular, the domain 
of Age (Education) is divided into two sub-domains, "<50" and 
">50" ("High school or below" and "Bachelor or above"). Multi- 
dimensional generalization is an extension of single-dimensional 
generalization: it allows QI values to be mapped to overlapping 
sub-domains. For instance, Table [5] shows a 2-diverse multi- 
dimensional generalization of Table [T] 

As multi-dimensional generalization imposes fewer constrains 
on how the QI values should be transformed, it can retain more 
information in the anonymized tables than suppression and single- 
dimensional generalization. For example, it can be verified that 
each QI value in Table[5]is equally or more accurate than the corre- 
sponding value in Table[3]or|4] However, suppression and single- 
dimensional generalization have a significant advantage over multi- 
dimensional generalization: the anonymized data they produce that 
can be directly used by off-the-shelf softwares (e.g., SAS [39], 
SPSS [40], Stata [41]) designed for microdata analysis. Specifi- 
cally, tables with suppressed values can be processed as microdata 
with missing entries. On the other hand, any single-dimensional 
generalization can be treated as a microdata table defined over at- 
tributes with coarsened domains, i.e., all analysis on the data are 
performed by regarding each sub-domain of a QI attribute as a unit 
value. 

In contrast, multi-dimensional generalization results in data that 
cannot be handled by existing softwares, due to the complex 
relationships among QI values represented by overlapping sub- 
domains. To understand this, consider that a user wants to count 
the number of individuals in Table|5]with ages in [30, 50). In that 
case, the user has to take into account not only the tuples with an 
Age value [30, 50), but also those with a value "<50", which is 
non-trivial since it is difficult to decide how those tuples may con- 
tribute to the query result. In general, performing analysis (e.g., 



regression, classification) on overlapping sub-domains is highly 
complicated, and hence, is not supported by off-the-shelf statisti- 
cal softwares. This explains why existing anonymization systems, 
like /i- Argus [18] and Datafly [42], adopt suppression and single- 
dimensional generalization instead of multi-dimensional general- 
ization. 

In summary, suppression and single-dimensional generalization 
are more preferable, if the data publisher aims to release data that 
can be easily used by ordinary users; otherwise, multi-dimensional 
generalization can be adopted. An interesting question is, how 
does suppression compare with single-dimensional generalization? 
To answer this question, in Section 16.21 we will experimentally 
evaluate our suppression algorithms against the existing single- 
dimensional generalization methods. 

Besides generalization, there also exists other methodologies for 
privacy preserving data publication. Kifer and Gehrke [23] pro- 
pose marginal publication, which releases different projections of 
the microdata onto various sets of attributes. Xiao and Tao [47] 
advocate anatomy that publishes QI and SA values directly in sep- 
arate tables. Aggarwal and Yu [3] design the condensation method, 
which releases only selected statistics about each Ql-group. Ras- 
togi et al. [36] employ the perturbation approach. 

Anonymization principles Privacy protection must take into ac- 
count the knowledge of adversaries. A common assumption is that 
an adversary has the precise QI values of all individuals in the mi- 
crodata. Indeed, these values can be obtained, for example, by 
knowing a person or consulting an external source such as a voter 
registration list [43]. 

Under this assumption, both fc-anonymity and ^-diversity aim at 
preventing the accurate inference of individuals' SA values. Many 
other principles share this objective, (a, k) -anonymity [46] com- 
bines the previous two principles: each Ql-group must have size k 
and at most a percent of its tuples can have the same SA value. 
m-invariance [49] is a stricter version of ^-diversity, by dictat- 
ing each group to have exactly m tuples with different SA values. 
The personalized approach [48] allows each individual to specify 
her/his own degree of privacy preservation. The above principles 
deal with categorical SAs, whereas (k, e)-anonymity [51] and t- 
closeness [29] support numerical ones, (k, e)-anonymity demands 
that each Ql-group should have size at least k, and the largest and 
smallest SA values in a group must differ by at least e. i-closeness 
requires that the SA-distribution in each Ql-group should not devi- 
ate from that of the whole microdata by more than t. 

5-presence [34] assumes the same background knowledge as the 
earlier principles, but ensures a different type of privacy. It prevents 
an adversary from knowing whether an individual has a record 
in the microdata. (c, k)-safety [32] tackles stronger background 
knowledge. In addition to individuals' QI values, an adversary may 
have several pieces of implicational knowledge: "if person o\ has 
sensitive value Vi, then another person 02 has sensitive value V2" '. 
(c, fc)-safety guarantees that, if an adversary has at most k pieces of 
such knowledge, s/he will not be able to infer any individual's SA 
value with a confidence higher than c. Achieving a similar purpose, 
the skyline privacy [10] guards against an extra type of knowledge. 
Namely, an adversary may have already known the sensitive values 
of some individuals before inspecting the published contents. 

Generalization algorithms Numerous heuristic algorithms have 
been developed to compute generalization with small information 
loss. Although with no provably good worst-case quality or com- 
plexity guarantees, these algorithms are general, since they can be 
applied to many of the anonymization principles reviewed earlier, 
and work with both numerical and categorical domains. Specifi- 



cally, a genetic algorithm is developed in [20], and the branch-and- 
bound paradigm is employed on a set-enumeration tree in [7,30]. 
Top-down and bottom-up algorithms are presented in [15,50], and 
the method in [26] borrows ideas from frequent item set mining. 
While all the above algorithms adopt single-dimensional general- 
ization, there also exist several multi-dimensional generalization 
methods. In [27], an algorithm is developed based on a partition- 
ing approach reminiscent of kd-trees. This algorithm is further im- 
proved in [28] to optimize anonymized data for given workloads. 
In [16], space filling curves are leveraged to facilitate generaliza- 
tion, and the work of [19] draws an analogy between spatial index- 
ing and generalization. As shown in [45], the previous algorithms 
may suffer from minimality attacks, which can be avoided by intro- 
ducing some randomization. 

Anonymity in other contexts The earlier discussion focuses on 
data publication, whereas anonymity issues arise in many other en- 
vironments. Some examples include anonymized surveying [6, 14], 
statistical databases [9], cryptographic computing [21], access con- 
trol [8], and so on. 

3. PROBLEM DEFINITIONS 

Let T be the raw microdata table, which has d quasi-identifier 
(QI) attributes A\, Ad, and a sensitive attribute (SA) B. Here, d 
is the dimensionality of T, and all attributes are categorical. Given 
a tuple t £ T, we employ t[Ai] to denote its i-th (1 < i < d) QI 
value, and t[B] its sensitive value. Use n to represent the cardinal- 
ity of T, and m to represent the number of distinct sensitive values 
in T. Without loss of generality we assume that all SA values are 
from the integer domain [m] — {1, . . . , m}. 

As in most pervious work on theoretical generalization algo- 
rithms, we assume that T is anonymized with suppression, which 
can be formally defined based on the concept of partition. Specifi- 
cally, a partition P of T includes disjoint subsets of T whose union 
equals T. We refer to each subset as a Ql-group. P determines an 
anonymization T* of T, where all tuples in the same Ql-group 
carry the same QI values, as shown next. 

Definition 1 (Generalization). A partition P of T de- 
fines a generalization T* of T as follows. For each Ql-group in P, 
if all the tuples in the group have the same value on Ai (i E [1, d]), 
then they keep this value in T* ; otherwise, their Ai values are re- 
placed with '*'. All tuples in T retain their SA values in T* . 

For example, Table [2] (or is a generalization of Table [T] de- 
termined by a partition with 4 (3) Ql-groups. As long as one QI 
value of a tuple is changed to a star, we say that this tuple has been 
suppressed. 

DEFINITION 2 (/-DIVERSITY). Given an integer I, a set S of 
tuples is l-eligible if at most \S\/l of the tuples have an identical 
SA value. A generalization T* is l-diverse if each Ql-group is l- 
eligible. 

We are ready to define the problem of optimal /-diverse general- 
ization. 

Problem 1 (Star Minimization). Given a microdata ta- 
ble T and an integer I, find an optimal /-diverse generalization of 
T that has the smallest number of stars. 

Note that there may be multiple optimal solutions with the same 
number of stars. An important property of /-diversity is monotonic- 
ity: 



LEMMA 1 ( [31]). Let Si and S2 be two disjoint sets of tu- 
ples. If both of them are l-eligible, then so is Si U S2. 

As an immediate corollary, Problem[T]has a solution if and only 
if T itself is /-eligible, i.e., at most \T\/l tuples of T carry the same 
sensitive value. In the following, we focus on only such microdata 
tables. It follows that m > /, where m is the number of distinct 
sensitive values in T, as mentioned before. 

A close companion of star minimization (Problem QJ is tuple 
minimization: 

Problem 2 (Tuple Minimization). Given a microdata 
table T and an integer I, find an optimal /-diverse generalization 
of T* that suppresses the least number of tuples. 

For instance, in Table [3] the amount of information loss is 8 
(stars) in Problem[T] but 4 (tuples) in Problem|2] Tuple minimiza- 
tion is different from star minimization because suppressing vari- 
ous tuples may require different numbers of stars. The following 
result builds a connection between the two problems. 

LEMMA 2. A X-appwxintate solution to Problem\2\is a X ■ d- 
approximate solution to Problem\l\ 

PROOF. Let Tj* and T 2 * be optimal solutions to Problems[T]and 
[2] respectively, and let T 3 * be a A-approximate solution to Prob- 
lem [2] Use ai and j3i to denote the number of stars and the num- 
ber of tuples suppressed in T*, respectively. Define a.2, fa and 

03 , fa in the same way for T 2 * and T 3 * , respectively. Since each 
suppressed tuple introduces between 1 and d stars, it holds that 
fa < ai < d ■ fa for i = 1, 2, 3. Hence, a 3 < d ■ fa < A • d- fa < 
A -d-fa < A • d- ai. □ 

In the following sections, we will show that star minimization is 
NP-hard when m > 3, and then, approach this problem through 
tuple minimization. 

4. HARDNESS OF STAR MINIMIZATION 

As discussed in Section [L2l there exists a straightforward re- 
duction from /-diversity to fc-anonymity. This reduction, however, 
works only when the number m of distinct sensitive values in the 
microdata table T equals the number of tuples in T. It is natural to 
wonder, in the more realistic scenario m <C |"3 |, is star minimiza- 
tion (Problem!]} still NP-hard? 

It is easy to observe a polynomial-time algorithm for m = 2. In 
this case, since I < m, the value of / must be 2 Q = 1 is useless 
for anonymization). Let Si (S2) be the set of tuples having the first 
(second) SA value. Thus, |Si| = IS2I = \T\/2\ otherwise, T is 
not 2-eligible and Problem[T]has no solution. Then, there exists an 
2-diverse optimal generalization where each Ql-group has 2 tuples, 
since any 2-diverse Ql-group of T with more than 2 tuples can be 
divided into smaller 2-diverse Ql-groups, without increasing the 
number of stars in generalization. Finding this generalization is an 
instance of bipartite matching. Specifically, we create a bipartite 
graph by treating Si and S2 as sets of vertices. Draw an edge 
between each pair (ti,£a) £ Si x S2. The edge has a weight 
equal to the number of stars needed to generalize ti and t% into the 
same form. No edge exists between vertices from the same set. An 
optimal 2-diverse generalization corresponds to a minimum perfect 
matching between Si and S2, which can be found in 0(|T| 3 ) time 
[24]. 

The above observation has also another implication. In [46], the 
authors prove that (a, fc)-anonymity (explained in Section|2]( is NP- 
hard. They do so by showing that (0.5, fc)-anonymity is NP-hard. 



Pi = (1, a, S),p 2 = (1, b, y),p 3 = (2, c, a), 
p 4 = (2, 6, a), p 5 = (3, b, y), p 6 = (4, d, fi) 

(a) The contents of S 
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(b) The constructed table T (m = 8) 
Figure 1: Illustration of reduction 

Recall that (0.5, fc)-anonymity is essentially the combination of fc- 
anonymity and 2-diversity. Intuitively, the hardness of (0.5, fc)- 
anonymity stems from the difficulty of fc-anonymity. Indeed, the 
proof in [46] no longer holds, when fc-anonymity is not required. 

Next, first assuming I = 3, we establish the NP-hardness of star 
minimization for any m > I. Later, we will extend the analysis to 
any I > 3. Our derivation is based on a reduction from a classical 
NP-hard problem 3 -dimensional matching (3DM) [22]. Specifi- 
cally, let Di, Z?2, D,i be three dimensions with disjoint domains, 
and these domains are equally large: \D\\ = \D^\ — \Ds\ = n. 
The input is a set S of d > n distinct 3D points p\, pa in the 
space D\ x D2 x D3. The goal of 3DM is to decide the exis- 
tence of an S' C S such that = n and no two points in S' 
share the same coordinate on any dimension. For example, assume 
Di = {1, 2, 3, 4}, D 2 = {a, b, c, d}, and D 3 = {a, (3, 7, <5}, and 
a set S of 6 points in FigureQJ. Then, the result of 3DM is "yes": 
a solution S' can be {pi,p3,p 5 , pel- 
Let vi, v n be the values in D\, v n +i, v 2n be the values 
in D2, and V2 n +i , V3n be the values in D3. We construct a 
microdata table T from S. Specifically, T has 

• a sensitive attribute B; 

• d QI attributes At, A2, Ad, where Ai (1 < i < d) corre- 
sponds to the i-th point pi in S; 

• 3n rows, where the j-th (1 < j < 3n) row corresponds to 

The rows in T are constructed as follows. Let t be the j-th (1 < 
j < 3n) row of T. We first select a positive integer u according to 
the value of j (details to be clarified shortly). Then, we set the SA 
value of t to it, i.e., t[B] = u. After that, for each i £ [1, d], we 
set t[Ai] to if Vj is a coordinate of point p, € S, or u otherwise. 
Because each pt has three coordinates, the following property of T 
holds. 

PROPERTY 1. For any i G [1, d], there exist exactly 3 rows in 
T that have value on Ai. 

The value of u is chosen in a way that ensures another two prop- 
erties of T . First, T should contain vn (m < 3n) distinct SA 
values, namely, 1, 2, m. Second, for any i, j G [1, 3n], if Vi and 
Vj belong to different domains (e.g., w, G D\ and Vj G D2), the 
z-th and j-th rows in T should have different SA values. 

Specifically, we set u = j for any j G [1, m — 2]. When j G 
[m— 1, 3n], we differentiate three cases according to the values of 
m and n: 



• If m — 1 > 2n, we let u = m— 1 if j G [m— 1, 3n— 1], and 

it = m ii j = 3n. 

• If 2n > m — 1 > ?i, then u — m — 1 if j G [m — 1, 2n], and 
u = m if j G [2n+l, 3n]. 

• If n > m— 1, we set (i) it = m— 2 if j G [771— 1, 7i], (ii) m = 
771 — 1 if j G [n+l, 2n], and (iii) it = m if j G [271+1, 3n]. 

FigureQJi demonstrates the T built from the 5* in FigureQJ, when 
771 = 8. For example, let t be the 7-th row (i.e., j = 7), which 
corresponds to value c G Since j = 7, n = 4, and ?n = 8, we 
have 2ti > 771— 1 > n and j G [m — 1, 2n]. Hence, it = m— 1 = 7. 
i [A3] equals 0, because c is the second coordinate of P3 G S. t has 
7 on other QI attributes because c is not the 2nd coordinate of any 
other point in S. 

Let T* be any 3-diverse generalization of T. We say that a QI- 
group Q in T* is futile if all the QI values in Q are stars (i.e., Q 
retains no QI information at all). Otherwise, Q is useful. T* have 
several properties. 

PROPERTY 2. If a QI-group Q in T* is useful, then all non-star 
QI values in Q must be 0. 

PROOF. Consider any i G [1, d], such that Q has no star on Ai 
after generalization. Then, before generalization, all tuples in Q 
should have the same value on Ai. Let this value be a;. By the way 
T is constructed, if x 7^ 0, all tuples in Q should have an SA value 
x, which contradicts the assumption that Q is 3-eligible. Therefore, 
x = holds. □ 

PROPERTY 3. Any useful QI-group Q in T* contains (i) exactly 
three tuples, (ii) 3(d — 1) stars, and (iii) 3 zeros. 

PROOF. Let h be the number of tuples in Q. Since Q is useful, 
there exists i G [1, d], such that all tuples in Q have value on Ai 
(see Property |2j. By Property Q] there exist only three tuples in T 
that have value on Ai. Hence, h < 3. On the other hand, because 
Q is 3-eligible, h > 3. Therefore, Q contains exactly three tuples. 

Consider that, before generalization, Q contains three rows t a , 
tb, and tc (a,b,c G [l,3n]) in T. Assume on the contrary that 
there exists y G [1, d], y 7^ i, such that Q has no star on A y . Recall 
that, the j-th (j G [1, 3n]) row in T has value on Ai (A y ), if 
and only if Vj is a coordinate of p; (p y ). Thus, each of v a , Vb, and 
v c should appear in both pi and p y . This indicates that pi = p y , 
leading to a contradiction. Therefore, Q should have on exactly 
one QI attribute. Since Q contains three tuples, the number of stars 
(zeros) in Q should be 3d — 3 (3). □ 

Property 4. T* has at least 3n(d — 1) stars. 

PROOF. Let us analyze the number of non-star QI values in T* . 
Each non-star can come from only a useful QI-group. According 
to Property [5] each such group contains 3 non-star QI values. As 
T* has 3n rows and each useful QI-group has 3 rows, there can be 
at most 71 useful Ql-groups. Therefore, the number of non-star QI 
values is at most 3n, and the property follows. □ 

LEMMA 3. The 3DM on S returns "yes", if and only if there is 
a 3-diverse generalization ofT with 3n(d — 1) stars. 

Proof. "Only-if direction": Without loss of generality, let S' = 
{pi, ...,p n } be the solution of the 3DM. Then, we create n useful 
Ql-groups Qi, Q n , where Qj (1 < i < ri) encloses the 3 
tuples in T whose values on attribute Ai are 0. By the way T is 
constructed, each of the 3 tuples corresponds to a coordinate of p;, 
and has a distinct SA value. Hence, Qi is 3-eligible. Since the 



points in S' do not share any coordinate, Qi, Q n are mutually 
disjoint, and hence, their union covers the entire T (which has 3n 
tuples in total). Generalizing each Qi (1 < i < n) introduces 
3(d — 1) stars, leading to totally 3n(d — 1) stars. The resulting 
generalization is 3-diverse. 

"If direction": Let T* be a 3-diverse generalization with 3n(d — 
1) stars. According to Property[3] T* has exactly n Ql-groups, and 
all of them are useful. Denote these groups as Qi, Q n . Since 
each group contributes 3(d — 1) stars, Q x (1 < x < n) has no star 
on exactly one Ql-attribute A; (i 6 [1, d]). Let us call Aj the useful 
Ql-attribute of Q x . Assume that, before generalization, Q x con- 
tains three rows t a , ti,, and t c (a, b,c £ [1, 3n]), where tj denotes 
the j-th row in T. Since i a [A] = t& [^4i] = ic[Ai] = 0, according 
to the way T is generated, v a , Vb, and v c should be the coordinates 
of pi. We define pi as the point in 5* corresponding to Q x . Ob- 
serve that, the useful QI attributes of any two Ql-groups must be 
different, otherwise there exist at least six tuples in T that have 
on the same QI attribute, contradicting Property Q] Consequently, 
each Ql-group should correspond to a distinct point in S. Without 
loss of generality, assume that pj is the point corresponding to Qj 
(1 < j < n). Let S" = {pi, ...,p n }. Since each row in T ap- 
pears in exactly one Ql-group, each coordinate in Di U D 2 U D3 
appears in exactly one point in S' . Therefore, S' is a solution to the 
3DM. □ 

Property 2] and Lemma [5] imply that we can decide whether S 
has a 3DM solution, by examining if an optimal 3-diverse general- 
ization of T has 3n(d — 1) stars. Therefore, if we had an optimal 
polynomial-time /-diversity algorithm that works on all microdata 
tables with m £ [I, \T\], this algorithm would also solve 3DM in 
polynomial time. 

Extending the above analysis in a straightforward manner, we 
can show that, for any / > 3, optimal /-diversity under the 
constraint m > I is also NP-hard, through a reduction from l- 
dimensional matching [17]. Thus, we arrive at: 

THEOREM 1. For any m > I > 3, optimal l-diverse general- 
ization (Problem\Q is NP-hard. 

We conclude this section by pointing out that our proof requires 
only an alphabet (i.e., the union of the domains of all attributes in 
the microdata) of size m + 1. For example, in FigureQ] m = 8 and 
T has 9 different values 0, 1, 8. 

5. TUPLE MINIMIZATION 

This section tackles tuple minimization (Problem [2}, and 
presents an algorithm with an approximation ratio of /. By 
Lemma [2] it leads to an (/ • d) -approximation for star minimiza- 
tion, resulting in the first /-diversity algorithm with a non-trivial 
worst-case bound on information loss. Furthermore, this algorithm 
leverages several novel heuristics that work fairly well in practice, 
and usually produce a solution with a much better quality than the 
upper bound. 

5.1 Algorithm Overview 

Since our goal is to minimize the number of tuples suppressed, 
we can redefine the problem as the following. Suppose that the 
microdata T is partitioned into s Ql-groups Qi, . . . , Q s , where 
tuples in the same Ql-group have the same value on every QI at- 
tribute. The problem is to remove the minimum number of tuples 
from Qi, . . . , Q s , such that: (a) all Ql-groups are /-eligible, and 
(b) the set of all removed tuples is /-eligible. We denote the set of 
removed tuples by R, and these tuples will correspond to the sup- 
pressed tuples. We refer to R as the residue set. Since switching 



tuples with the same QI and SA values will not change the quality 
of the solution, in the following we will not distinguish such tuples. 
In this manner, the Ql-groups and R are effectively considered as 
multisets. 

Initially R is empty. Throughout the algorithm, tuples are only 
moved to R but never taken back. We follow different rules to 
pick tuples to remove in the three phases of the algorithm. In the 
first phase, we will make sure that condition (a) above is satisfied. 
If condition (b) is also met, the algorithm immediately terminates. 
Otherwise in phase two, we try to do an "easy fix" of the problem 
by removing some more tuples from the Ql-groups without violat- 
ing condition (a). If at any time during phase two condition (b) is 
met, the algorithm ends, or else we proceed to phase three. In the 
last phase, we do an "overhaul" in order to satisfy condition (b) by 
removing tuples in large batches. Before giving the details for each 
phase below, we point out that approximations are introduced in 
succession: If the algorithm terminates during the first phase, then 
the returned solution is guaranteed to be optimal; if the algorithm 
ends in phase two, only an additive error of I — 1 is introduced; only 
in phase three may we encounter a multiplicative error of I. 

Sections [5. 2l5.4l describe the conceptual procedures of the three 
phases respectively, and analyze their theoretical guarantees. We 
defer the running time discussion to Section [531 

5.2 Phase One 

For a Ql-group Q and an S A value v, denote the number of tuples 
in Q with SA value v by h(Q, v). We call the SA value with the 
most tuples the pillar SA value, or simply the pillar. The number 
of tuples in the pillar is called the pillar height of Q, denoted by 
h(Q) = max„ h(Q,v). Note that there could be more than one 
pillar in a Ql-group. These terms are similarly defined on the set of 
removed tuples R. 

Algorithm The rule of phase one is simple: for each Ql-group, 
repeatedly remove one tuple from its pillar until the Ql-group is /- 
eligible. If there is more than one pillar, the choice can be arbitrary. 
Note that although we break ties arbitrarily, the end result is unique, 
the reason being the following. When there are more than one pillar 
in the Ql-group that is still not /-eligible, removing a tuple from 
any of the pillars will not decrease the pillar height, and hence the 
Ql-group will not become /-eligible after the removal. Only after 
all pillars have lost one tuple does the Ql-group have a chance of 
becoming /-eligible. In other words, no matter what order is taken, 
we will eventually remove one tuple from each pillar. 
At the end of phase one, we check if R is /-eligible, i.e., 

\R\ > I ■ h{R). (1) 

If Q} holds, then the algorithm terminates; otherwise we proceed 
to phase two. 

Consider the example in Table[JJwith / = 2. Initially we have 4 
Ql-groups: {1, 2}, {3}, {4}, {5, 6, 7, 8}, {9, 10}. After phase one, 
the first three Ql-groups are completely eliminated, and the other 
two Ql-groups remain unchanged. The set R of removed tuples 
have the following (multi)set of SA values: {HIV, HIV, pneumonia, 
bronchitis}. In this case R is already /-eligible and thus the whole 
algorithm terminates. However, if we are not so lucky, we need to 
go to phase two. 

Analysis Let Qi, . . . ,Q S be the Ql-groups at the end of phase 
one, and R the set of removed tuples. If (TJ holds after phase one, 
then (Qi, . . . ,Q S ,R) must be an optimal solution. In fact, we can 
prove a stronger result. 

LEMMA 4. For any 1 < i < s and any subset Q' t of Qi that is 
l-eligible, h(Q'i,v) < h(Qi,v) for all SA values v. 



The proofs for the rest of the paper can be found in the appendix. 
Based on Lemma[4] we prove the following corollary. 

COROLLARY 1. If the algorithm terminates after phase one, 
then {Qi, . . . , Qs, R) is an optimal solution. 

Let OPT be the number of tuples in R in the optimal solution. 
The following lower bound on OPT is another easy corollary of 
Lemma|4]that will be useful later on. 

Corollary 2. OPT > I ■ h(R). 

5.3 Phase Two 

In phase two, we try to increase \R\ while keeping h(R) un- 
changed by removing tuples from the Ql-groups while maintaining 
their /-eligibility. We continue the process until Inequality Q} is 
satisfied, or no more tuples can be removed. 

Before describing the phase two algorithm we need some more 
terminology. We know that |Q| > / ■ h(Q) for any Ql-group Q at 
the end of phase one. We say that Q is thin if \Q\ = I -h(Q), and fat 
if \Q\ > I ' h(Q) + 1. If Q has one or more pillars that are also the 
pillars of R, then Q is a conflicting Ql-group; these pillars are the 
conflicting pillars of Q. If Q is both thin and conflicting, it is said 
to be dead; otherwise it is alive. Intuitively, a dead Ql-group cannot 
lose any more tuples without either increasing h(R) or violating its 
own /-eligibility. An SA value v is alive if there exists at least one 
alive Ql-group Q such that h(Q, v) > 0. 

Algorithm Phase two proceeds iteratively as follows. In each it- 
eration, we pick an alive SA value v such that h{R, v) is mini- 
mized, i.e., v is the least frequent alive SA value in R. When there 
are multiple such SA values, we pick one arbitrarily. If there is 
no alive SA value, then phase two cannot solve the problem and 
we enter phase three. Otherwise, we go to the Ql-group Q where 
h(Q,v) > 0; again the choice is arbitrary if there is more than 
one option. There are two cases: If Q is fat, then we simply re- 
move a tuple from Q with SA value v, decrementing h(Q, v) while 
incrementing h(R, v). If Q is thin, then by definition it must be 
non-conflicting, so we remove a tuple from each of Q's pillars. 
Note that this may or may not increase h(R, v). This iteration now 
ends. If at this time R becomes /-eligible, the whole algorithm 
terminates; otherwise a new iteration starts. 

Consider the following example with m = 5 SA values, 
s = 3 Ql-groups, / = 3, and Qi — (3, 1, 1, 2, 3), Qi = 
(0, 2, 2, 4, 4), Q 3 = (4, 4, 0, 0, 0) before phase one. (For nota- 
tional simplicity we use the vector presentation for multisets. For 
instance, (3,1,1,2,3) means there are three tuples with SA value 
1, one tuple with SA value 2 and 3 respectively, two and three 
tuples with SA values 4 and 5 respectively.) In phase one, Q\ 
and Q2 do not change, while all tuples of Q3 are removed. Thus 
at the end of phase one the status is Qi — (3, 1, 1, 2, 3), Q2 = 
(0,2,2,4, 4), R = (4,4,0,0,0). Now phase two starts. In the 
first iteration, there are five alive SA values: 1, 2, 3, 4, and 5. 
Suppose we pick v — 3. Note that both Q\ and Q2 can give 
a 3 to J?, and the choice can be arbitrary. Say we remove a 
3 from Qi, changing the status to Q\ — (3, 1, 0, 2, 3), Qi = 
(0,2,2,4,4),./? = (4,4,1,0,0). Now Qi is dead, since it is 
both thin and conflicting (the conflicting pillar is 1). In the sec- 
ond iteration, still 3, 4, 5 are all alive SA values, but since 4 
and 5 have the minimum h(R, v), we pick one of them arbitrar- 
ily, say 4. Q2 now is the only alive Ql-group: it is thin but non- 
conflicting. So we remove a 4 and a 5 together, changing the status 
to Qj = (3, 1, 0, 2, 3), Q 2 = (0, 2, 2, 3, 3), R = (4, 4, 1, 1, 1). In 
the third iteration, 3, 4, 5 are all possible choices, and Q2 is fat. 



Say we remove a 3, which results in Q\ = (3, 1, 0, 2, 3), Q2 = 
(0, 2, 1, 3, 3), R = (4, 4, 2, 1, 1). At this point R has become Z- 
eligible, and the algorithm terminates. 

Analysis Let Qi, . . . ,Q S , Rbe the status at the end of phase two. 
We first prove that h(R) does not increase in this phase, that is: 

Lemma 5. h(R) = h(R). 

In general, R may not be /-eligible. However, if it is, then the 
following guarantee holds. 

LEMMA 6. If the algorithm terminates during phase two, then 
\R\ < I ■ h(R) +1-1. 

By combining Corollary[2]and Lemma[6] we have the following 
result. 

COROLLARY 3. If the algorithm terminates during phase two, 
then it returns a solution such that \R\ < OPT + / — 1. 

It is clear that all Ql-groups are dead after phase two (unless the 
algorithm terminates). In this case, the following property holds, 
which will be useful later on. 

LEMMA 7. IfQi, ■ ■ . , Qs are all dead and R is not l-eligible, 
then for any pillar p of R, there exists some Qi such that p is not a 
conflicting pillar of Qi. 

The following corollary follows from Lemma[7] 

COROLLARY 4. If the algorithm does not terminate after phase 
two, then R has at least two pillars. 

The result above implies that for the special case / = 2, the 
algorithm always terminates during the first two phases. 

THEOREM 2. For I — 2, our algorithm always solves the tuple 
minimization problem during the first two phases with a solution 
\R\ < OPT + 1. 

5.4 Phase Three 

In most cases the algorithm will stop in the first two phases. 
However, on some "hard" inputs we will have to resort to phase 
three. In the output from phase two, all Ql-groups are thin and con- 
flicting, and still \R\ < I ■ h(R). The failure of phase two suggests 
that in order to satisfy Inequality |QJ, we cannot just increase \R\. 
We need to increase both \R\ and h(R), but in a careful way such 
that the amount of increase in \R\ is more than / times the increase 
in h(R), so that eventually the gap between \R\ and / • h(R) can be 
closed. 

Algorithm The third phase proceeds in rounds, each consisting of 
two steps. In the first step, we pick a subset S of Ql-groups, and re- 
move one tuple from each of their pillars. This increases h(R) but 
also (possibly) makes these Ql-groups fat. Meanwhile, since cer- 
tain pillars of R might have disappeared after this step, some other 
Ql-groups may switch from conflicting to non-conflicting. More 
precisely, a greedy algorithm is used to decide S. Initially we set 
P to be the set of pillars of R. For a Ql-group Q, let C(Q) be 
the set of conflicting pillars of Q. The greedy algorithm iteratively 
does the following. As long as P is not empty, pick the Ql-group 
Q that minimizes \C{Q) n P\, and then set P^Pn C(Q). Note 
that here the problem is equivalent to SET COVER [11], i.e., we 
are using C(Q) as the "sets" to cover all the pillars of R, and this 



greedy algorithm is actually the same as the standard heuristic for 
SET COVER. 

In the second step, for each Ql-group Q, if it has become alive 
after step one, then we keep removing tuples from Q until it be- 
comes dead again, using the following simple rules: If Q is fat, 
remove a tuple from any SA value that is not a pillar of R. If Q is 
thin, then check if it is conflicting. If yes, then we are done with 
this Ql-group; otherwise we remove a tuple from each of its pillars. 
If at any time R becomes ^-eligible, the whole algorithm termi- 
nates. Note that if the algorithm does not terminate after a round, 
all Ql-groups have become dead again. 

The following is an example showing how phase three works. 
Suppose m = 5, s = 2, 1 = 4, and the status after phase two 
is Qi = (3, 1,2, 3, 3), Q 2 = (1,3, 2, 3, 3), J? = (4,4,4,0,0). 
Note that Qi and Q2 are both thin and conflicting: Qi conflicts 
on 1 while Q2 conflicts on 2. In step one of the first round, we 
pick Ql-groups whose C(Q) together cover the pillars {1,2,3} 
of R. As C(Qi) = {2, 3, 4, 5} and C(Q 2 ) = {1, 3, 4, 5}, the 
greedy algorithm chooses both Q\ and Q2. Then we remove one 
tuple from each of the pillars of Qi and Q2, resulting in the follow- 
ing configuration: Qi = (2, 1,2, 2,2), Q 2 = (1,2,2,2,2),.R = 
(5, 5, 4, 2, 2). In step two, we first remove tuples from Q\ until it 
becomes dead. As Q\ is fat and SA values 3, 4, 5 are not a pil- 
lar of R, we can remove any tuple of those SA values from Qi. 
Suppose a 3 is removed, resulting in Q\ = (2, 1, 1, 2, 2), Q2 = 
(1, 2, 2, 2, 2), R = (5, 5, 5, 2, 2). Similarly we remove a 4 from 
Q2, leading to Qi = (2, 1, 1, 2, 2), Q 2 = (1, 2, 2, 1, 2), R = 
(5, 5, 5, 3, 2). At this point, R is /-eligible and the algorithm ter- 
minates. In this simple example, there is only one round, but in 
general there could be multiple rounds. 

Analysis We now analyze the approximation ratio guaranteed by 
the algorithm. As it turns out, the key factor is to bound the increase 
in h(R) throughout phase three. 

LEMMA 8. In each round of phase three, h(R) increases by at 
most I — 2. 

LEMMA 9. There are at most h(R) rounds in phase three. 

Based on Lemmas [8] and [9] we prove our main theorem as fol- 
lows. 

THEOREM 3. Our algorithm finds an l-approximate solution to 
the tuple minimization problem. 

5.5 Implementation 

Our three-phase algorithm can be implemented efficiently using 
inverted list structures. In this subsection, we present an implemen- 
tation, which has a worst-case time complexity of 0(s ■ n). 

The basic data structure We maintain an array Ai for each Ql- 
group Qi throughout the algorithm, as well as an Ar for the set of 
removed tuples 7?. Suppose that Qi has n, tuples, for i = 1, . . . , s. 
The array Ai has rij entries. The j-th entry, .4i[J], contains a 
pointer to a list of SA values v such that h(Qi, v) = j. Note that 
some entries of Ai may be empty. Along with each SA value v, we 
keep a pointer to a list of tuples in Qi with this SA value, called 
the SA set of v. For each Ai, we also maintain pi, the maximum 
index j such that Ai [j] is nonempty. In other words, Ai [pi] always 
points to the list of pillars of Qi. We similarly maintain the pillar 
pointer pr for Ar. The whole data structure uses linear space and 
can also be easily initialized in O(n) time. 

This data structure supports an update, i.e., moving a tuple from 
some Qi to R in constant time. To move a tuple t, we first remove 



it from its SA set, stored at some -4i[i]- If 3 = 1> we a l so delete 
the S A set; otherwise, we move the S A set from Ai [j] to Ai [j • — 1] . 
Next, we insert t to Ar, and the procedure is symmetric. Finally, 
we also update pi and pr. Note that although pi may decrease a 
lot in a single update, the amortized cost of maintaining pi is 0(1), 
since pi only moves in one direction, and the total distance it travels 
is at most rii. 

Phase one Consider the Ql-group Qi with n< tuples. In phase 
one, we simply keep removing tuples from the pillar of Qi, i.e., the 
first SA set in the list pointed by Ai [pi]. Since the update cost for 
each removed tuple is O(l), and we can also easily check if Qi is 
Z-eligible after each update, the running time for this Ql-group is 
0{rii), implying a total running time of 0(n) for phase one. 

Phase two To efficiently implement our phase two algorithm, an- 
other inverted list C, called the candidate list, is required. It is an 
array of size n. At C[j] we store a list of entries of the form (i, v), 
one for each alive SA value v in Qi if h(R, v) = j. That is, the list 
at C[j] stores (the pointers to) all possible SA sets from which we 
can remove tuples. C can be initialized in 0(n) time. It can also be 
maintained with O(l) cost after a tuple is inserted to R. 

In each iteration of phase two, we pick a pair (i, v) from the list 
stored at the first non-empty entry C[j]. Next we check if Qi is 
fat. If it is we simply remove a tuple from Qi with SA value v; 
otherwise we remove a tuple from each of Qi's pillars. At the end 
of the iteration, we check if Qi is dead. If so we remove all its 
entries (i, v) from C. Since the cost to remove a tuple is 0(1), and 
there are at most n entries in C, the total cost of phase two is 0(n). 

Phase three The first step of each round in phase three is the stan- 
dard greedy algorithm for SET COVER, which can be implemented 
in 0(s ■ I) time [1 1], since there are s sets and each set has cardi- 
nality at most I. In the second step, by using the inverted list Ai, a 
Ql-group Qi can be handled in time 0(1 + r), where r is the num- 
ber of tuples removed from Qi. To see this, note that every time we 
apply the rule, we either remove tuples, whose cost can be charged 
to the 0(r) term, or declare that Qi is dead, whose cost is at most 
0(1). Since we remove at most n tuples in total, the overall cost 
of phase three is thus 0(sl ■ h(R) + n) as there are at most h(R) 
rounds by Lemma[9] Finally, since h(R) < n/l, we conclude that 
the total cost of phase three is 0(s-l-n/l + n) = O(s-n). Note that 
this is a very pessimistic bound, as the typically number of rounds 
is much smaller than n/l in practice. 

THEOREM 4. Our three-phase algorithm can be implemented 
inO(s ■ n) time. 

5.6 Discussions 

The performance of our algorithm is sensitive to the diversity of 
QI values in the microdata. If most tuples in the microdata have 
distinct QI values, the first phase of our algorithm would start with 
a large number of QI groups that contain less than I tuples; even- 
tually, all tuples in these QI groups will be moved to the set R and 
suppressed, leading to a significant number of stars in the gener- 
alized data. Such degradation of data utility usually occurs when 
the microdata contains QI attributes with large domains. For exam- 
ple, a micordata table with Birth Date, Gender, and ZIP Code as 
the QI attributes would contain a significant number of tuples that 
have distinct QI values, since both Birth Date and ZIP Code have 
sizable domains, and hence, any two tuples are likely to differ on 
either attributfl 

2 Indeed, a recent study [43] has shown that 87% of the U.S. popu- 
lation can be uniquely identified by their birth dates, genders, and 
5-digit ZIP codes. 



Despite the above drawback, our algorithm can still be useful in 
some scenarios, due to the following reasons. First, our algorithm 
can be applied on datasets with small or median QI domains. Such 
microdata exists, as many QI attributes in practice, such as Gender, 
Race, Marital Status, Years of School Attendance, have domains 
with cardinalities below 20. 

Second, QI attributes with large domains often need to be coars- 
ened (even before generalization is performed) to avoid disclosure 
of excessively detailed personal information. For example, the 
Standards for Privacy of Individually Identifiable Health Informa- 
tion [12] (issued by the U.S. Department of Health and Human Ser- 
vices) requires that, unless otherwise justified, any personal data to 
be published should satisfy the following two conditions (in addi- 
tion to numerous other requirements): 

1. given any date directly related to an individual (e.g., birth 
date, admission date, discharge date), only the year of the 
date is released; 

2. only the first three digits of any ZIP code are retained. 

Therefore, given a dataset with QI attributes Birth Date and ZIP 
Code, if the publisher is to release the data in a manner that con- 
forms to the above standard, s/he should transform Birth Date to 
Year of Birth, and remove all but the initial three digits of any ZIP 
code. This considerably reduces the domain size of the attributes, 
making our algorithm applicable on the dataset. 

Third, our algorithm can be easily combined with any heuris- 
tic suppression algorithm to improve its performance over datasets 
with diverse QI values. Specifically, given a micordata table, we 
can first employ our algorithm to obtain (i) a set of Ql-groups that 
contain no stars, and (ii) the residue set R. After that, we can ap- 
ply any existing heuristic algorithm on R to divide it into smaller 
Ql-groups, thus reducing the number of values that need to be sup- 
pressed. Apparently, such a hybrid approach always outperforms 
our algorithm in star minimization, and hence, it also achieves an 
approximation ratio of 0(1 ■ d). 

Last but not least, given a microdata table, we may prepro- 
cess it with any single-dimensional generalization method to re- 
duce the cardinalities of the QI domains, and then apply our al- 
gorithm on the modified dataset. The preprocessing step method 
does not need to ensure /-diversity: even the fc-anonymity algo- 
rithms [7, 15,20,26,44] can be applied. The amount of generaliza- 
tion imposed in the preprocessing step has an effect on the quality 
of the /-diverse table output by our algorithm. In particular, less 
generalization leads to large domains of the QI attributes, which, in 
turn, results in more stars in the /-diverse table. On the other hand, 
when the QI attributes are coarsened to a higher degree during pre- 
processing, each non-star QI value in the /-diverse tale corresponds 
to a larger sub-domain of the QI attribute, i.e., the published QI 
values are less accurate. To achieve a good tradeoff between the 
number of stars and the accuracy of non-star QI values, we may 
vary the amount of generalization in the preprocessing step, exam- 
ine the output of our algorithm, and choose the setting that opti- 
mizes the utility of the /-diverse table. A complete treatment of this 
issue, however, is beyond the scope of this paper. 

6. EXPERIMENTS 

This section experimentally evaluates the proposed techniques. 
Section 16.11 examines the performance of our algorithms in star 
minimization, and Section 16.21 compares our algorithms with 
single-dimensional generalization methods. All of our experiments 
are performed on a computer with a 3 GHz Pentium IV CPU and 2 
GB RAM. 
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6.1 Star Minimization 

Algorithms evaluated The existing /-diversity techniques employ 
either single- or multi-dimensional generalization. We examine the 
state of the art [15, 16,27] of these techniques, modify them as sup- 
pression algorithms, and choose Hilbert [16], the one that achieves 
the best performance in star minimization, as the baseline with 
which our algorithms are compared. We denote the three phase 
algorithm in Section [5~Tl as TP. We have also implemented a hybrid 
algorithm, TP + , which combines both Hilbert and TP. Specifically, 
given a microdata T, TP + first invokes TP to produce a partition of 
T, and then applies Hilbert on the residue set ii (produced by TP) 
to reduce the number of stars in the /-diverse table. As discussed in 
Section [5"7ol such a hybrid algorithm also returns an 0(l-d) solution 
for the star minimization problem. 

Datasets Following [16,47], we experiment with two datasets, 
SAL and OCC, obtained from the American Community Survey 
[37]. Both SAL and OCC contain 600k tuples, each capturing the 
information about a U.S. adult. Specifically, SAL has a sensitive at- 
tribute Income, and 6 QI attributes Age, Gender, Race, Marital Sta- 
tus, Birth Place, Education, Work Class. OCC contains the same 
QI attributes as in SAL, but has a different sensitive attribute Occu- 
pation. Table|6]illustrates the domain size of each attribute. 

Based on SAL, we generate 7 sets of microdata, SAL-1, SAL-2, 
SAL-7. Each table in SAL-d (1 < d < 7) is a projection of 
SAL on Income and d QI attributes. As SAL has 7 quasi-identifers, 
totally there are (^) microdata tables in SAL-d. Similarly, we also 
construct 7 sets of microdata OCC-d (1 < d < 7) from OCC. 

Quality of generalizations In the first set of experiments, we in- 
vestigate the effect of / on the quality of the generalization pro- 
duced by each technique. In particular, for any given /, we employ 
each algorithm to generate /-diverse versions of the microdata in 
SAL-4 (OCC-4). Then, the performance of an algorithm is gauged 
by the average number of stars, in the /-diverse generalization it 
generates for the (.) = 35 microdata tables in SAL-4 (OCC-4). 

Figure [2] illustrates the average number of stars as a function of 
/. All algorithms perform better when / decreases, since a smaller I 
leads to a lower degree of privacy protection, which can be achieved 
with less generalization. Both TP and TP + consistently outperform 
Hilbert. In addition, TP incurs a smaller number of stars than TP 
in all cases. 

Next, we examine the performance of each algorithm, fixing 
/ = 6 and varying the number d of QI attributes in the micro- 
data. Figure [3] shows the average number of stars incurred by each 
technique, for the tables in SAL-d and OCC-d (1 < d < 7). The 
average number of stars increases with d, which is consistent with 
the analysis in [2] that, all generalization techniques suffer from 
the curse of dimensionality. On SAL-d (OCC-d), TP outperforms 
Hilbert when d < 4 (d < 6), but is inferior than Hilbert given a 
larger d. This is due to the fact that, as d increases, the tuples in 
the microdata tend to have more diverse QI values, which, as dis- 
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cussed in Section [5161 renders TP less effective. TP + overcomes 
this drawback by incorporating Hilbert to refine the residue set R, 
and hence, achieves better data utility than both TP and Hilbert. 

Frequency of phase three execution Recall that TP consists of 
three phases. For any positive integer I and any microdata T 
with d QI attributes, if TP terminates during the first or second 
phase, the number of stars in the returned generalization is at most 
d ■ (OPT + 1 — 1), where OPT is the minimum number of stars in 
any i-diverse generalization of T. In contrast, if TP terminates after 
phase three, the resulting generalization is an (I ■ d)-approximation. 
Furthermore, the first two phases of TP have 0(n) time complex- 
ity, while the third phase runs in 0(s ■ n) time in the worst case, 
where s is the maximum number of tuples in T with distinct QI 
values. Therefore, TP performs much better in terms of both in- 
formation loss and computation time, when it returns generalized 
tables without invoking phase three. 

A natural question is, how often does TP execute the third phase? 
To answer this question, we apply TP on each microdata table in 
SAL-d and OCC-d (1 < d < 7) to compute its Z-diverse (2 < 
I < 10) generalization, and examine whether TP invokes the third 
phase. It turns out that, on all 128 tables and for all 9 values of 
I, TP terminates before the third phase. In other words, in all our 
experiments, TP (and thus, TP + ) returns O(d) solution to the star 
minimization problem. 

Computation overhead In the following experiments, we com- 
pare the efficiency of each algorithm. First, for any I € [2, 10], we 
examine the average time required by each technique to generate 
/-diverse versions of the microdata in SAL-d (OCC-d). Figure [4] 
illustrates the computation time as a function of I. The overhead 
of Hilbert decreases with the increase of I, which is also observed 
in [16]. In contrast, the computation cost of TP and TP + increases 
with I. To understand this, recall that TP works by first dividing the 
tuples into Ql-groups, and then iteratively moving tuples from each 
Ql-group to the residue set R, until all Ql-groups and R become l- 
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Figure 6: Computation time vs. n (I — 6) 

eligible. Given a larger I, TP has to remove more tuples from each 
Ql-group to achieve /-eligibility, resulting in higher computation 
cost. In turn, this indicates that the residue set R becomes larger, 
when I increases. Consequently, the running time of TP + also in- 
creases with I, because TP + post-processes the output of TP by 
invoking Hilbert on R, the cost of which increases with the size of 
R. 

Next, we fix I = 6, and investigate the average computation time 
of each algorithm on the microdata in AGE-d (OCC-d), varying d 
from 1 to 7. Figure [5] illustrates the results. The computation cost 
of TP increases with d. This is because, when d is large, TP has to 
employ more generalization on the microdata to achieve /-diversity 
(see Figure [3](. As a result, TP needs to move a larger number of 
tuples from the Ql-groups to the set R, leading to higher processing 
overhead. Because TP + incorporates TP, its computation time also 
increases with d. The efficiency of Hilbert is insensitive to d, which 
is consistent with the experimental results in [16]. 

Finally, we study the effect of dataset cardinality n on the com- 
putation time of generalization. For each table T in SAL-4 and 
OCC-4, we generate various sample sets of T, with sample size 
varying from 100k to 600k. After that, we employ each algorithm 
to compute a 6-diverse generalization of each sample set, and mea- 
sure the average running time of the algorithm. Figure [6] plots the 
computation overhead as a function of the dataset cardinality n. 
The running time of each technique is less than 1.2 seconds even 
for the largest datasets. The processing cost of TP increases lin- 
early with n. This is expected, since (i) TP bypasses phase three 
in all cases, and (ii) the first and second phases of TP have linear 
time complexity. The computation time of Hilbert is almost linear, 
which confirms the analysis in [16] that Hilbert runs in 0(n log n) 
time. Since both TP and Hilbert scale well with n, TP + (as a com- 
bination of TP and Hilbert) also achieves satisfactory scalability. 

Summary In terms of data utility, TP + significantly outperforms 
not only TP but also Hilbert, the best existing algorithm that 



can achieve ^-diversity via suppression. In terms of computation 
time, Hilbert is superior than TP and TP + . Nevertheless, as the 
anonymization of microdata incurs only one-time cost, computa- 
tional efficiency is not a major concern in data publishing. This 
makes TP + more preferable than Hilbert for suppression-based 
anonymization. 



Single-Dimensional 



6.2 Comparison with 
Generalization 

Having established TP + as an excellent suppression-based algo- 
rithm, in this section we will move on to compare TP + with the 
single- and multi-dimensional generalization methods. First, we 
observe that multi-dimensional generalization always guarantees 
higher data utility than suppression. Specifically, given any table 
T* generated by suppression, we may transform it into a multi- 
dimensional generalization T*' , by replacing each star on a QI at- 
tribute A with a sub-domain of A, such that the sub-domain con- 
tains all A values appearing in the Ql-group. As each sub-domain 
captures more accurate information than a star, T*' always incurs 
less information loss than T* . For example, let us consider Table[3] 
which contains four stars on Age and Education, respectively, and 
all the stars appear in the first Ql-group. We may replace each star 
on Age with a sub-domain "<50", as it covers the Age values of all 
tuples in the Ql-group (see TableQ}. Similarly, each star on Educa- 
tion can be replaced with a sub-domain "Bachelor or above". This 
results in the multi-dimensional generalization in Table [5] which 
apparently contains more accurate information than Table [2] 

As discussed in Section[2] however, multi-dimensional general- 
ization produces anonymized data that is unusable by off-the-shelf 
statistical package, whereas suppression does not suffer from this 
drawback. Consequently, even though multi-dimensional general- 
ization outperforms suppression in terms of data utility, it cannot 
be chosen over suppression in the scenarios where software sup- 
port for anonymized data is a concern. Yet, in such scenarios, 
suppression is not the only applicable anonymization method, as 
single-dimensional generalization can also generate data that can 
be directly fed into commercial statistical software. This leads to an 
interesting question: how does TP + compare to the existing single- 
dimensional generalization methods in terms of data utility? 

To answer the above question, we implement Tofl the state- 
of-the-art single-dimensional generalization algorithm proposed in 
[15], and compare it against TP + on the quality of generalization. 
Following [16,23], we measure the quality of a generalized table 
T* , by the similarity between the multi-dimensional distribution 
induced by T* and the distribution induced by the microdata T. 
To explain this, observe that each tuple in T can be regarded as a 
point in a (c!+ 1) -dimensional space fl, where the i-th (1 < i < d) 
dimension corresponds to the i-th QI attribute in T, and the (d+ 1)- 
th dimensional corresponds to the sensitive attribute. As such, T 
can be captured by a probabilistic density function (pdf) / defined 
on O, such that, for any point p G fi, f(p) equals the fraction of 
tuples in T represented by p. 

Similarly, any generalization T* of T defines a pdf /* on Q. 
In particular, if a tuple t* € T* has a star on an attribute A, we 
treat t* [A] as a random variable uniformly distributed in the domain 
of A; on the other hand, if t* [A] is a sub-domain of A, we treat 
t* [A] as uniformly distributed in the sub-domain. As in [16,23], we 
gauge the similarity between / and /* by their KL-divergence [25], 



3 TDS was initially designed for fc-anonymity. We modify it into an 
i-diversity algorithm to facilitate the comparison with TP + . 
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A smaller KL(f,f*) indicates a higher degree of similarity be- 
tween / and /*. 

In the first set of experiments, we apply TP + and TDS on the 
microdata in SAL-4 and OCC-4, varying I from 2 to 10. Figure|7] 
plots the average KL-divergence incurred by each algorithm as a 
function of I. TP + significantly outperforms TDS in all cases. The 
KL-divergence entailed by TP + increases with /, which is consis- 
tent with the results in Figure [2] that, a larger I leads to more stars 
in the generalized table. 

Next, we fix I — 6, and measure the average KL-divergence 
incurred by TP + and TDS in anonymizing the microdata in SAL- 
d (OCC-d). Figure [8] illustrates the average KL-divergence as a 
function of d. Again, the information loss caused by TP + is con- 
sistently smaller than TDS. The performance of both algorithms 
degrades with the increase of d, since, as mentioned in Section RTTl 
all generalization methods inevitably suffer from the curse of di- 
mensionality. 

In summary, TP + achieves significantly higher data utility than 
TDS. This makes TP + a favorable choice for data publishers who 
aim to release generalized tables that can be easily analyzed us- 
ing existing statistical software. Multi-dimensional generalization 
methods, on the other hand, should be adopted when the users are 
equipped with their own tools for analyzing complex anonymized 
data. 

7. CONCLUSIONS 

The existing work on Z-diversity focuses on the development 
of heuristic solutions. In this paper, we present the first theoret- 
ical study on the complexity and approximation algorithms of l- 
diversity. First, we prove that computing the optimal i-diverse gen- 
eralization is NP-hard, for any I > 3. After that, we develop an 
0(1 ■ d) -approximation algorithm for the problem, where d denotes 
the number of QI attributes in the microdata. The effectiveness and 
efficiency of the proposed technique are verified through extensive 
experiments. 



There exist several promising directions for future work. First, 
we plan to improve our three phase algorithm, to achieve a bet- 
ter approximation ratio for the star minimization problem. Second, 
we have only considered categorical domains in this paper, in the 
future we will try to extend our algorithm to support numerical do- 
mains. Finally, it is interesting to investigate the hardness and ap- 
proximation algorithms for other privacy principles. 
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9. APPENDIX 

Proof of Lemma [4] Consider the following two cases. If ini- 
tially h(Qi,v) < h(Qi), then the algorithm must have not re- 
moved any tuple of SA value v; hence h(Q' i ,v) < h(Qi,v) 
trivially. If h(Qi,v) > h(Qi) initially, then the algorithm 
will reduce the number of tuples in Qi to exactly h(Qi), i.e., 
h(Qi,v) = h(Qi). We argue that h(Q'i) < h(Qi), which im- 
plies h(Q'i, v) < h(Qi) < h(Qi) — h(Qi,v). Assume by con- 
tradiction that h(Q'i) > h(Qi). Consider the set Q" such that 
h(Qi',v) — min{h(Q' i ),h(Qi,v)} for all v, i.e., Q" is obtained 
from Qi by reducing the number of tuples of each SA value v to 
h(Q'i), whenever h(Qi,v) > h(Q' i ). Note that we must have 
Qi C Q'{. Since Q\ is Z-eligible and h(Q'i) = /i(Q"), <5" must 
also be /-eligible. Thus, the algorithm would have stopped earlier 
at Q", whose pillar height is higher than that of Qi, reaching a 
contradiction. □ 

Proof of Corollary [T] Let Q[, . . . , Q' s , R' be an optimal solu- 
tion. Since Qi is /-eligible, by Lemma[4] we have h(Q'i,v) < 
h(Qi,v). Summing over all v, we have \Q[\ < \Q%\, and thus 
\R'\ =n- E'=i IQil > « - Ei=i l<3<l = \ R l If the algorithm 
stops after phase one, R is also /-eligible, and Qi, . . . , Q s , R is a 
valid solution; hence = 7?'|. □ 

Proof of Corollary [2] Let Q[, . . . ,Q' a , R' be an optimal solu- 
tion. By Lemma[4] h(Q'i,v) < h(Qi,v) for any v. Summing 
over all i, we have Ei=i h(Q'i> v ) — Ei=i MOii u )- Since 

fc(#, «) + E-=i ^(Qi. «) = h (R, v) + E-=i fc(0<. »). we have 

h(R',v) > h(R,v), in particular, /i(fl') > h(R). Since 7?' is 
/-eligible, OPT = \R'\ > I ■ h(R') > I ■ h{R). □ 

Proof of Lemma|5] The lemma is equivalent to the claim that phase 
two never picks a pillar of R to move tuples to. Indeed, if a pillar 
p of R is picked in some iteration, then there must be an alive QI- 
group Q such that h(Q,p) > 0. Since Q is /-eligible, it contains 
at least / different SA values, i.e., h(Q, v) > for each of those 
values v. As Q is alive, by definition, all the SA values in Q are 
alive. On the other hand, R has at most I — 1 pillars; otherwise, R is 
/-diverse, and the current iteration should not have started. Hence, 
there should be at least one alive SA value that is not a pillar in R. 
So the algorithm should have picked that value instead of p (recall 
that each iteration selects the least frequent alive SA value in R). 

□ 

Proof of Lemma [6] Since the algorithm did not stop after phase 
one, 7? is not /-eligible, implying \R\ < I- h(R). As h(R) — h(R) 
(Lemma[5]l, the algorithm will stop as soon as \R\ reaches / ■ h(R). 
In each iteration of phase two, we remove at most / tuples together, 
since a thin /-eligible Ql-group has at most / pillars. Therefore, \R\ 
at most exceeds / • h(R) by I — 1 when the algorithm terminates. 

□ 

Proof of Lemma|7] Assume for contradiction that p is a conflicting 
pillar in all Qi. Since R is not /-eligible, we have 

\R\ <l-h{R,p). (3) 

For any i, since Qi is thin and has p as one of its conflicting pillars, 
we have 

\Qi\=l-h(Qi,p). (4) 
Summing 10 over all i and 10, we have 

n < I ■ h(T,p), 

where h(T,p) represents the total number of tuples with SA value 



p in the microdata T. This contradicts with the assumption that T 
is /-eligible. □ 

Proof of Corollary|4] If R has only one pillar, then all Qi can only 
conflict with R on this pillar, contradicting Lemma[7] □ 

Proof of Theorem|2] If the algorithm terminates in phase one, then 
the theorem follows from CorollaryQ] Otherwise, it must terminate 
during phase two, due to Corollary [4] and the fact that if R has 
at least two pillars, then it must be 2-eligible. Then the theorem 
follows from Corollary [3] □ 

Proof of Lemma|8] Let Q%, . . . , Q a , R be the status at the begin- 
ning of a particular round. We know that all the Qi's are dead, and 
\R\ < I ■ h(R). Thus Lemma[7]still holds on Q%, . . . , Q a ,R, i.e., 
for any pillar p of R, there exists a Ql-group in which p is not a con- 
flicting pillar. In other words, p is covered by at least one C(Q). 
Thus, the greedy algorithm will pick at most I — 1 Ql-groups before 
it finishes (R has at most / — 1 pillars). Afterward, each pillar of 
these Ql-groups will ship a tuple to R. We distinguish between two 
cases. For a pillar p of 7?, h(R, p) increases by at most / — 2 since 
there is at least one Ql-group in which p is not a pillar. For other 
SA values v of 7?, h(R, v) increases by at most / — 1. But since 
h(R, v) < h(R) — 1, these other SA values will not cause h(R) to 
increase by more than / — 2, either. □ 

Proof of Lemma [9] Define the gap for 7? to reach /-eligibility as 
A (7?) = / ■ h(R) — 1 7? | . The algorithm will terminate as soon as the 
gap reduces to zero or negative. At the beginning of phase three, 
we have 



A(7?) = / • h(R) -\R\<1- h(R). 



(5) 



Next we consider how much the gap reduces in each round. Sup- 
pose in the first step of a round, the greedy algorithm picks r Ql- 
groups. Following the same reasoning as in the proof of Lemma[8] 
h(R) increases by at most r — 1. On the other hand, for any Ql- 
group Q picked by the greedy algorithm, its pillar height h(Q) de- 
creases by one. In the second step of this round, we remove tuples 
from Q until it becomes thin again, meaning that a total of / tuples 
(including those removed in the first step) must have been removed 
in this round. Henceforth, \R\ has increased by at least / • r tuples 
in this round. So the net effect is that A(R) must have decreased 
by at least I ■ r — l(r — 1) = / tuples. 

Combining with (f5}, we conclude that the total number of rounds 
is at most A(7?) jl < h(R). □ 

Proof of Theorem [3] Let 7? be the final set of removed tuples at 
the end of phase three. By Lemmas[8]and [9] we have 

h(R) < h(R) + 2)h(R) = {l- i)h(R). 

Since in the second step of each round of phase three, we remove at 
most I tuples together and the algorithm terminates as soon as \R\ 
reaches / • h(R), we have 

\R\ < I- h(R) + l- 1. 

Also note that h(R) = h(R), we have 

|7?| < 1(1 - i)h(R) + 1 - 1. 

By Corollary[2] we can bound the approximation ratio as 

I R I 



OPT 
hence the proof. 



< l(l-l)h(R) + l-l ^ 
/ ■ h(R) 



□ 



